Introduction
Exploratory Data Analysis
Spatial Autocorrelation
Principal Component Analysis
Multivariate Kriging Model
Classic Machine Learning Model
Conclusion
Sources
Can identifying environmental factors help inform the risk of crime at population level?
Goal: to build models to predict crime based on places information
Data
There was a total of 104,322 crimes between April 2021 and March 2024 in Barnet.
In this period, anti-social behaviour was the most common type of crime, followed by violent crime.
However, in the last 12 months, violent crime has become the most common type of crime, which raises concerns.
Spatial autocorrelation measures the extent to which values in a spatial dataset are similar or dissimilar to their neighbours. Spatial correlation can be positive (clustering), negative (dispersion) or zero (random).
After creating distance-based weights, global spatial autocorrelation test was conducted for each type of crime.
Only anti-social behaviour showed a significant spatial autocorrelation (p-value < 0.001) with a Moran’s I statistic of 1.2E-03, which was a weak positive autocorrelation.
Type | P Value | Moran's I |
|---|---|---|
Anti-Social Behaviour | 0.000e+00 | 1.2e-03 |
Violent Crime | 6.545e-01 | -2.0e-04 |
Other | 5.789e-01 | -3.0e-04 |
Vehicle Crime | 3.330e-01 | 2.0e-04 |
Theft from the Person | 1.271e-01 | 2.3e-03 |
Burglary | 1.349e-01 | 1.9e-03 |
Changing the distance threshold did not change the finding.
As anti-social behaviour (ASB) only demonstrated spatial autocorrelation, the points that precisely show autocorrelation will be identified.
Most points were not statistically significant. However, some cold spots were detected in Burnt Oak and East Finchely.
Quadrant | Count |
|---|---|
High-High | 0 |
Low-Low | 1,534 |
High-Low | 4 |
Low-High | 0 |
Not Significant | 24,013 |
Kriging is a method of spatial interpolation, which models the spatial relationship between points and hence gives less weight to points that are farther away from each other. A kriging model creates a prediction surface based on location coordinates and other predictors.
The multivariate kriging model gave a predicted count of ASB ranging from 1.3 to 24.4.
The highest prediction was estimated to be in North Finchley, close to a many shops on high street. A couple more hot spots were identified in Golders Green and Colindale.
Overall, kriging model had an root mean square error (RMSE) of 12.1.
The error ranged from -20.5 to 259.5 and had a median of -0.1 and a mean of 3.3. Assuming that the absolute error lower than 10 is a moderate estimation, the model fairly predicts the number of ASB.
However, the model wasn’t able to capture high ASB crime spots. Greatest error was spotted along North Circular Road in East Finchley near St Pancras and Islington Cemetery. While the model predicted about 3.5 ASB crimes based on places around, there were 163 ASB crimes over three years.
Before building a kriging model, principal component analysis (PCA) was performed. Rather than building kriging model with 52 predictors, seven principal components, which contained about 70% of information of the predictors, were used to make predictions.
Train error is negligible as its value is extremely small.
On the other hand, test error shows some variation over different number of principal components.
Keeping 4 or 6 components appears to be a reasonable choice as RMSE is fairly low with 4 or 6 components.
For easier interpretation, 4 was chosen as the number of components for PCA model.
Principal Component Analysis (PCA) is a linear dimensionality reduction method. The data is linearly transformed onto a new coordinate system in a way that it identifies principal components capturing the largest variation in the data.
Dimension 1 to 7 explained about 70% of the total variance.
| Dimension | Top 10 Contributors | Main Theme |
|---|---|---|
| 1 | Distance to nearest: car repair shop, electronics shop, money exchange and transfer, garages, vet, houseware shop, gas station, liquor shop, and launderette | Urban outskirt |
| 2 | Distance to nearest: bakery, bank, clothes, lawyer’s office, bridge, real estate agent, post-secondary institution (e.g., college or university), ATM machines, post office, and convenience store | High streets |
| 3 | Distance to nearest: bar, clinic, doctor’s office, houseware shop, aesthetics shop (beauty), lawyer’s office, post-secondary institution, hospital, post office, and vet | Healthcare setting |
| 4 | Distance to nearest: post depot, garage, grave yard, post office, warehouse, car wash, car dealer, aesthetics, and clothing shop | Urban outskirt |
Among linear regression, support vector machine and random forest models in a five-fold cross validation set, random forest model had lowest error measured by RMSE.
The 10 most important features in the random forest model were vicinity to: clothing shop, bicycle parking, houseware shop, parking lot, bank, pharmacy, convenience, fuel, restaurant, and community centre. These are a mix of features from high streets and urban outskirts.
Predicted count of ASB by random forest model is overall similar to that of kriging model, however, with a slightly wider range.
Highest prediction of around 40 ASB cases was estimated in Chipping Barnet.
Similar to kriging model, the model was not able to predict the points with high number of ASB points.
Highest error was observed in North Cricklewood around Pennine Mansions. We can see that the point is located within blocks of flats and is in the vicinity of some shops and bus stations on the street.
Anti-social behaviour (N = 25,551) was the most prevalent type of crime in Barnet between April 2021 and March 2024. However, in the last 12 months between April 2023 and March 2024, violent crime (N = 7,217) was the most common type of crime, followed by anti-social behaviour (N = 6,912), which raises concerns.
Only anti-social behaviour of all crime types demonstrated a statistically significant spatial autocorrelation. That is, areas nearby locations where anti-social behaviour took place are also likely to have anti-social behaviour. Changing the threshold of distance for classifying neighbouring points did not change the finding.
Distances to each nearest place of interest were summarised into seven components by PCA. Evaluation of multivariate kriging model later showed that four components are sufficient for the model to perform. The four components primarily captured urban outskirts, high streets and healthcare settings.
Kriging model predicted the number of anti-social behavoiur, ranging from 1.3 to 25.5. When the prediction was compared with the actual count of anti-social behaviour, it fairly captured the locations where anti-social behaviour did not happen too frequently. Nonetheless, the model poorly captured hot spots. This could be due to high resident or transient population density in the area.
Amongst linear regression, support vector machine and random forest models, random forest performed best. Distance to nearest clothes shop, bicycle parking spaces, houseware stores, car parking lots, bank, pharmacy, convenience store, gas station, restaurant, and community centre were the ten most important features in the random forest model. However, similar to kriging model, the random forest model was not able to capture hot spots of anti-social behaviour. Given that the locations with highest error observed were in either blocks of flats or near tube station, it is also likely that the model’s underestimation arises from the lack of adjustment of crime count by population density.
Lack of adjustment for population density
Adjustment by population traffic with footfall data
Subset crime points around tube stations and adjust traffic volume with tap in and out data by TfL
Weak spatial autocorrelation
Data Quality
Crime data
Biased patterns in patrol, leading to more records of crime near police station
Retroactive update of crime count
Places data
May not be totally inclusive of terrestrial information
Kriging model was able to make prediction over location where no anti-social behaviour was not observed in the period of investigation.
Both kriging and random forest models were able to identify cold spots of anti-social behaviour within the London Borough of Barnet. This may be of use to Metropolitan Police to optimise resource allocation.
Lesson learned include:
Importance of version control & utilising saveRDS()
Importance of evaluation to test model performance